Geospatial analysis is the ingestion, manipulation, display, and modeling of geographic and spatial data. The data can be represented explicitly as coordinates (latitude & longitude) or implicitly as addresses, Census tracts, or other identifiers.
This guide outlines the many tools in R for geospatial analysis and mapping. New tools are always being developed, so this guide will be occasionally updated as better methods emerge.
R can have a steeper learning curve than point-and-click tools for geospatial analysis like ArcGIS. The extra effort is worth the reward.
I hope this guide is a useful resource that helps you accomplish your existing research ideas or helps you come up with entirely new research ideas. Please don’t hesitate to contact Aaron Williams (awilliams@urban.org) if you have any questions about this guide or need any assistance with R.
Additionally, Rob Pitingolo (rpitingolo@urban.org) and the Urban Institute Mapping Users Group are excellent resources for mapping and geospatial analysis.
Most mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). Hadley Wickham’s ggplot2 is based on Leland Wilkinson’s The Grammar of Graphics and Wickham’s A Layered Grammar of Graphics. The layered grammar of graphics is a structured way of thinking about the components of a plot, which then lend themselves to the simple structure of ggplot2 and ultimately mapping.
The Urban Institute Data Visualization Style Guide is the best place to start when making any visualization. The guide offers some blunt suggestions for maps:
Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.
That said, some of the most-convincing data visualizations are maps. The style guide has well-defined guidelines for standard chart and graph types, but some of the more complex or novel visualizations below will need to be styled and tweaked on a case-by-case basis. For external-facing products (presentations, publications, web interactives, etc.) you will likely need to collaborate with Urban’s Communications team. Please contact Ben Chartoff (bchartoff@urban.org) with questions or thoughts related to visual presentation, or discuss with the editorial team in the course of editing your publication/product.
Here are a few things to keep in mind:
source('https://raw.githubusercontent.com/UrbanInstitute/urban_R_theme/master/urban_theme_windows.R') and on Mac with source('https://raw.githubusercontent.com/UrbanInstitute/urban_R_theme/master/urban_theme_mac.R').No guide can answer every question. A lot of thought goes in to making a good map. These blog posts are good sources of inspiration for answering those deeper questions:
Choropleth maps display geographic areas with shades, colors, or patterns in proportion to a variable or variables. Choropleth maps can represent massive geographies like the entire world and small geographies like Census Tracts.
library(fiftystater) is the best R package for mapping the the entire United States divided into states because it includes Hawaii and Alaska. Be sure to include coord_map("albers", lat0 = 39, lat1 = 45) to create an Albers Equal Area projection.
# load necessary packages
library(tidyverse)
library(forcats)
library(gridExtra)
library(fiftystater)
# read the state CHIP data
chip <- read_csv("mapping/data/chip-enrollment.csv")
# create data frame with vector boundaries from library(fiftystater)
data("fifty_states")
# set the state names to lower case and create the five groups based on the order
# of state CHIP enrollment
chip <- chip %>%
mutate(state = tolower(State)) %>%
arrange(`CHIP Enrollment`) %>%
mutate(enrollment_group = c(rep("Group 1", times = 11),
rep("Group 2", times = 10),
rep("Group 3", times = 10),
rep("Group 4", times = 10),
rep("Group 5", times = 10)))
# create a vector with 5 hexadecimal colors for the 5 groups
urban_colors <- c("#cfe8f3", "#a2d4ec", "#46abdb", "#12719e", "#062635")
# plot the data with colors based on state CHIP enrollment and white borders
# between the states
ggplot(data = chip, mapping = aes(map_id = state)) +
geom_map(mapping = aes(fill = enrollment_group),
map = fifty_states,
color = "#ffffff",
size = 0.25) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
scale_fill_manual(values = urban_colors) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(title = "State CHIP Enrollment") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0))
library(fiftystater) is a better choice than library(mapdata) for mapping the entire United States because it includes Alaska and Hawaii. But library(mapdata) is better for mapping individual states or regions because the data are easier to filter. The key is to use map_data() to read in geographies and geom_polygon() to map the geographies.
Variables mapped as colors or shades need to be merged using left_join() to geographic data loaded by map_data(). In this example, the variable region in state has lower case state names and the variable state in chip has upper case state names. mutate(state = tolower(State)) fixes this issue before the merge.
If ever in doubt, use anti_join() to test which cases don’t merge.
# load the necessary packages
library(tidyverse)
library(mapdata)
# read the shapefiles for the Continental US
state <- map_data("state")
# read in the CHIP data
chip <- read_csv("mapping/data/chip-enrollment.csv")
# tweak the theme for the map
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank()
)
# set the state names to lower case and create the five groups
chip <- chip %>%
mutate(state = tolower(State)) %>%
arrange(`CHIP Enrollment`) %>%
mutate(enrollment_group = c(rep("Group 1", times = 11),
rep("Group 2", times = 10),
rep("Group 3", times = 10),
rep("Group 4", times = 10),
rep("Group 5", times = 10)))
# merge the state shapefiles with the state CHIP data
state <- left_join(x = state, y = chip, by = c("region" = "state"))
# test merge criteria for mismatches
# hopefully nrow(mismatches) == 0
mismatches <- anti_join(x = state, y = chip, by = c("region" = "state"))
# create a vector with 5 hexadecimal colors for the 5 groups
urban_colors <- c("#cfe8f3", "#a2d4ec", "#46abdb", "#12719e", "#062635")
# map!
ggplot(data = state, mapping = aes(x = long, y = lat, group = group, fill = enrollment_group)) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
geom_polygon(color = "#ffffff", size = 0.25) +
scale_fill_manual(values = urban_colors) +
theme_map
library(mapdata) makes it is easy to filter to a subset of geographies. Here, filter(region %in% c("washington", "oregon", "idaho")) is used to filter to three states before ggplot() is called.
# load the necessary packages
library(tidyverse)
library(mapdata)
# read the shapefiles for the three states
state <- map_data("state") %>%
filter(region %in% c("washington", "oregon", "idaho"))
# create a theme for the maps
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank()
)
# map!
ggplot(data = state, mapping = aes(x = long, y = lat, group = group)) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
geom_polygon(color = "#ffffff", fill = "gray", size = 0.25) +
labs(title = "Northwest United States") +
theme(plot.caption = element_text(size = 12)) +
theme_map
map_data() can be used to select a number of different geographies including the USA, states, counties, France, Italy, New Zealand, the world, and world2. Here, map_data("county") is used to map all counties in the Continental United States.
# load the necessary packages
library(tidyverse)
library(mapdata)
# read shapefiles for counties
county <- map_data("county")
# create a theme for the maps
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank()
)
# map!
ggplot(data = county, mapping = aes(x = long, y = lat, group = group)) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
geom_polygon(color = "#ffffff", fill = "gray", size = 0.1) +
theme_map
Geographies can be added on top of each other as layers. In this case, geom_polygon() is called once with no fill and a thick border to outline states and then a second time with fill and a thin border to outline counties.
# load the necessary packages
library(tidyverse)
library(mapdata)
# read the shapefiles for counties
county <- map_data("county") %>%
filter(region %in% c("washington", "oregon", "idaho"))
# read the shapefiles for states
state <- map_data("state") %>%
filter(region %in% c("washington", "oregon", "idaho"))
# create a theme
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank()
)
# map
ggplot(data = county, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "#ffffff", fill = "gray", size = 0.2) +
geom_polygon(data = state,
mapping = aes(x = long, y = lat, group = group),
color = "#ffffff",
fill = NA,
size = 0.4) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(title = "Counties in the Northwest United States") +
theme_map
The groups of states above use the same Albers Equal-Area Conic Projection as the maps of the entire United States. For single state maps, the Urban Institute uses state plane coordinate systems. This requires loading the data from library(mapdata) and converting the coordinates to the correct state plane coordinate systems.
Use state_proj from library(USAboundaries) to explore epsg coordinate reference system definitons and projection <- state_plane("VA", plane_id = NULL, type = "epsg") to pass the correct epsg coordinate reference system definiton to spTransform().
# load necessary libraries
library(tidyverse)
library(mapdata)
library(rgdal)
library(sp)
library(USAboundaries)
# load data and filter it to just Virginia
state <- map_data("state") %>%
filter(region %in% c("virginia"))
# this data frame from library(USAboundaries) lists 124 coordinate systems in
# the United States and is a good reference
#state_proj
# pick the state plan coordinate system for Virginia
projection <- state_plane("VA", plane_id = NULL, type = "epsg")
# turn long and lat into spatial coordinates
coordinates(state) <- ~ long + lat
# convert projection to
# http://spatialreference.org/ref/epsg/wgs-84/
proj4string(state) <- CRS("+init=epsg:4326")
# convert coordinates to the state plane coordinate system for Virginia
state <- spTransform(state, CRS(paste0("+init=epsg:", projection)))
# convert spatial points data frame into a tibble
state <- as_tibble(state)
# create a theme
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank()
)
# map
ggplot(data = state, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "#ffffff", fill = "gray", size = 0.2) +
labs(title = "Virginia") +
coord_fixed() +
theme_map
Leaflet is a JavaScript library for building interactive maps. The syntax is different than ggplot2, but library(leaflet) makes easy building interactive maps in R.
This post by Julia Silge is an excellent introduction to library(leaflet) and library(tidycensus).
# load necessary packages
library(tidyverse)
library(sf)
library(stringr)
library(rgdal)
library(leaflet)
library(stringr)
library(viridis)
library(tidycensus)
# read mortgage data and drop unnecessary variables
hmda <- read_csv("mapping/data/hmdamap2015_10_purch.csv",
col_types = cols(
tract = col_character(),
year1 = col_integer(),
tract_T = col_character(),
aa = col_integer(),
his = col_integer(),
nhw = col_integer(),
asian = col_integer(),
total = col_integer())) %>%
select(-year1, -tract_T)
# The Census tract IDs in the hmda data do not perfectly match the Census tract
# IDS in the shapefiles. Add zeros to the front of the 10-digit census tracts
# so the IDs match before using left_join()
hmda <- hmda %>%
mutate(tract = ifelse(!str_length(hmda$tract) == 11, str_c("0", tract), tract))
# Pull data from the ACS using library(tidycensus). Rename GEOID to tract so the
# tract ID has the same variable name as on the mortgage data.
ct_value <- get_acs(geography = "tract",
variables = "B25077_001",
state = "CT",
geometry = TRUE) %>%
rename(tract = GEOID)
# join housing data to the shapefiles
ct_value <- left_join(x = ct_value, y = hmda, by = "tract")
# check for mismatches using anti_join. Hopefully nrow(mismatches) == 0
mismatches <- anti_join(x = ct_value, y = hmda, by = "tract")
# create a color palette based on the total number of mortgages variable
pal <- colorNumeric(palette = c("#cfe8f3", "#062635"),
domain = ct_value$total)
# create the map
connecticut <- ct_value %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 1,
color = ~ pal(total)) %>%
addLegend(position = "bottomright",
pal = pal,
values = ~ total,
title = "Mortgage Purchases",
opacity = 1)
library(htmltools)
browsable(
tagList(list(
tags$head(
# you'll need to be very specific
tags$style("* {font-family:Lato !important;}")
# could also use url
#tags$link(href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css",rel="stylesheet")
),
connecticut
))
)
# https://stackoverflow.com/questions/35720698/is-it-possible-to-include-custom-css-in-htmlwidgets-for-r-and-or-leafletr
It is often desirable to map geographies that aren’t easily available in an R package. Like any GIS, R can work with shapefiles, which are a vector data storage format created by Esri to store location, shape, and attributes of geographic features. Shapefiles are available for most geographies.
readOGR() from library(rgdal) is the most used function for reading shapefiles. One challenge of working with shapefiles is the hierarchical data structure. Click on the shapefile in the R Studio environment to explore the tree and see how the data are organized. Another challenge is merges. In the following example, mortgage data need to be merged with the appropriate Census tracts.
# load necessary packages
library(tidyverse)
library(sp)
library(stringr)
library(rgdal)
library(broom)
# read in mortgage data
hmda <- read_csv("mapping/data/hmdamap2015_10_purch.csv",
col_types = cols(
tract = col_character(),
year1 = col_integer(),
tract_T = col_character(),
aa = col_integer(),
his = col_integer(),
nhw = col_integer(),
asian = col_integer(),
total = col_integer())) %>%
select(-year1, -tract_T)
# The Census tract IDs in the hmda data do not perfectly match the Census tract
# IDS in the shapefiles. Add zeros to the front of the 10-digit census tracts
# so the IDs match before using left_join()
hmda <- hmda %>%
mutate(tract = ifelse(!str_length(hmda$tract) == 11, str_c("0", tract), tract))
# read in shapefile for Washington D.C.
tracts_dc <- readOGR(dsn = "mapping/shapefiles/tl_2015_11_tract", verbose = FALSE)
# create an extra data frame with the tract number and the position from 1 to n
tract_id <- tibble(tract = as.character(tracts_dc@data$GEOID),
position = 1:length(tracts_dc@data$GEOID))
# convert shapefile to tibble. Add a vector for position that will be used to
# merge the Census tract
tracts_dc <- tidy(tracts_dc) %>%
rename(position = id) %>%
mutate(position = as.numeric(position)) %>%
mutate(position = position + 1)
# merge on tract number
tracts_dc <- left_join(x = tracts_dc, y = tract_id, by = "position")
# check for mismatches using anti_join. Hopefully nrow(mismatches) == 0
mismatches1 <- anti_join(x = tracts_dc, y = tract_id, by = "position")
# With the new tract number, merge on the mortgage data
tracts_dc <- left_join(x = tracts_dc, y = hmda, by = "tract")
# check for mismatches using anti_join. Hopefully nrow(mismatches) == 0
mismatches2 <- anti_join(x = tracts_dc, y = hmda, by = "tract")
# munge the data and transform it into a long data frame
tracts_dc_long <- tracts_dc %>%
gather(key = "race", value = "Mortgages",
-long, -lat, -order, -hole, -piece,
-group, -position, -tract, -total) %>%
mutate(race = factor(race,
levels = c("aa", "asian", "his", "nhw"),
labels = c("African American", "Asian", "Hispanic", "Non-Hispanic White"))) %>%
mutate(Mortgages = ifelse(Mortgages == 0, NA, Mortgages)) %>%
mutate(Mortgages = Mortgages * 10)
# plot!
tracts_dc_long %>%
ggplot(mapping = aes(x = long, y = lat, group = group, fill = Mortgages)) +
geom_polygon(size = 0.3) +
coord_map() +
facet_wrap(facets = ~race, nrow = 1) +
scale_fill_continuous(low = "#CEE8F3",
high = "#094C6B",
breaks = c(0, 20, 40, 60, 80)) +
labs(x = NULL,
y = NULL) +
labs(title = "Mortgage Originations",
subtitle = "Home purchase loans in 2015 by race and ethnicity",
caption = "Urban Institute analysis of HMDA data") +
theme(axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
legend.position = "right",
legend.direction = "vertical",
legend.title = element_text(face = "bold", size = 11),
strip.text = element_text(size = 11),
plot.caption = element_text(size = 11),
strip.background = element_rect(fill = "#ffffff"))
Choropleths tend to overemphasize low-population density areas because size represents area instead of the statistic of interest. This is especially true for maps of the United States because many western states are large and have low-density populations. Tile grid maps, which give equal size to each state, mitigates this problem.
This example fills geom_tile() from library(ggplot2) with five discrete colors based on state CHIP enrollment and then uses facet_geo() from library(geofacet) to create a tile grid map.
# load the necessary packages
library(tidyverse)
library(forcats)
library(geofacet)
# read CHIP enrollment data
chip <- read_csv("mapping/data/chip-enrollment.csv",
col_types = cols(
State = col_character(),
`CHIP Enrollment` = col_integer(),
state_abbreviation = col_character()))
# Add xdimension = 1 and ydimension = 1 for dimensions of geom_tile. x and y can be any arbitrary numbers greater than zero as long as xdimesion == ydimension. Arrange the observations by state enrollment and add groups for the colors of the tiles
chip <- chip %>%
mutate(xdimension = 1,
ydimension = 1) %>%
arrange(desc(`CHIP Enrollment`)) %>%
mutate(enrollment_group = c(rep("Group 1", times = 11),
rep("Group 2", times = 10),
rep("Group 3", times = 10),
rep("Group 4", times = 10),
rep("Group 5", times = 10)))
# create a vector with 5 hexadecimal colors for the 5 groups
urban_colors <- c("#062635", "#12719e", "#46abdb", "#a2d4ec", "#cfe8f3")
# create a custom geofacet grid
urban_grid <- tibble(
row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 8, 8, 8),
col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)
# create tile grid map
chip %>%
ggplot(aes(x = xdimension, y = ydimension, fill = enrollment_group)) +
geom_tile() +
scale_fill_manual(values = urban_colors) +
geom_text(aes(label = state_abbreviation), color = "white") +
facet_geo(facets = ~state_abbreviation, grid = urban_grid) +
labs(title = "State CHIP Enrollment",
x = NULL,
y = NULL) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
strip.text = element_blank(),
plot.background = element_rect(color = "white"),
panel.background = element_blank(),
panel.grid = element_blank(),
panel.spacing = unit(0L, "pt"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
library(geofacet) adds geofaceting functionality to library(ggplot2). Geofaceting arranges sub-geography-specific plots into a grid that resembles a larger geography.
Interactive geofacets of the United States have been used by the Urban Institute in the features section. For example, “A Matter of Time” included geofaceted line charts showing trends in incarceration by state. Static geofacets of the United States were included in “Barriers to Accessing Homeownership Down Payment, Credit, and Affordability” by the Housing Finance Policy Center.
library(geofacet) comes with two default United States grids, but the Urban Institute uses a custom layout for United States geofacets. The code for that layout is included in each of the following examples. Creating custom grids is simple and is outlined in the vignette for library(geofacet). This grid designer is useful for testing custom layouts. Custom geofacet grids can be made for any sized geography and any set of geographies.
Lots of custom formatting in theme() is necessary because geofaceted plots are different than the plots for which the theme was optimized. Copying-and-pasting the code in theme() is a good start.
The first example is a variation on the tile grid map. Here, state labels are moved to the strip and values are added in each square as a text geom.
# load the necessary packages
library(tidyverse)
library(geofacet)
library(fivethirtyeight)
# create a tibble with state name abbreviations
state_abbreviations <- tibble(state = c(state.name, "District of Columbia"),
abbreviation = c(state.abb, "DC"))
# read the bad_drivers data from library(fivethirtyeight) and add maximum
# x- and y- dimensions for each geom_tile. The dimension could be any positive
# integers as long as x == y
usa_drivers <- bad_drivers %>%
mutate(xdimension = 1,
ydimension = 1)
# merge state abbreviations on to bad_drivers based on state
usa_drivers <- left_join(x = usa_drivers, y = state_abbreviations, by = "state")
# test merge criteria for mismatches. hopefully nrow(mismatches) == 0
mismatches <- left_join(x = usa_drivers, y = state_abbreviations, by = "state")
# create a custom geofacet grid
urban_grid <- tibble(
row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 8, 8, 8),
col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)
# create tile grid map
usa_drivers %>%
ggplot(aes(x = xdimension, y = ydimension, fill = perc_alcohol)) +
geom_tile() +
geom_text(aes(label = paste0(as.character(perc_alcohol), "%")),
color = "white",
family = "Lato") +
scale_fill_gradientn(colors = c("#cfe8f3", "#a2d4ec", "#46abdb", "#12719e", "#062635")) +
facet_geo(facets = ~abbreviation, grid = urban_grid) +
labs(title = "Don't Drink and Drive",
subtitle = "% of Drivers in Deadly Wrecks Who Were Alcohol-Impaired",
caption = "National Highway Traffic Administration \n Data from library(fivethirtyeight)",
x = NULL,
y = NULL) +
theme(plot.background = element_rect(colour = "white"),
panel.grid = element_blank(),
panel.grid.major = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
panel.spacing = unit(0L, "pt"),
legend.position = "none",
strip.text.x = element_text(size = 9L))
Unlike the tile grid map, geofaceting isn’t limited to just text and color in the facets. This geofaceted map uses bar plots. Be careful, it can quickly clutter.
# load necessary packages
library(tidyverse)
library(fivethirtyeight)
library(geofacet)
# create data frame with state driving data
usa_drivers <- bad_drivers %>%
mutate(perc_distracted = 100 - perc_not_distracted) %>%
select(state, perc_speeding, perc_alcohol, perc_distracted) %>%
gather(key = issue, value = value, - state) %>%
mutate(issue = ifelse(issue == "perc_speeding", "Speeding", issue),
issue = ifelse(issue == "perc_alcohol", "Alcohol-Impaired", issue),
issue = ifelse(issue == "perc_distracted", "Distracted", issue))
# create a tibble with state name abbreviations
state_abbreviations <- tibble(state = c(state.name, "District of Columbia"),
abbreviation = c(state.abb, "DC"))
# merge state abbreviations on to bad_drivers based on state
usa_drivers <- left_join(x = usa_drivers, y = state_abbreviations, by = "state")
# test merge criteria for mismatches. hopefully nrow(mismatches) == 0
mismatches <- anti_join(x = usa_drivers, y = state_abbreviations, by = "state")
# create a custom geofacet grid
urban_grid <- tibble(
row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 8, 8, 8),
col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)
# plot!
ggplot(data = usa_drivers, aes(x = issue, y = value, fill = issue)) +
geom_col() +
coord_flip() +
scale_y_continuous(expand = c(0, 0),
limits = c(0, 100),
breaks = c(0, 50, 100),
labels = c("0", ".5", "1")) +
facet_geo(facets = ~abbreviation, grid = urban_grid) +
labs(title = "Driving is dangerous",
subtitle = "Proportion of drivers in fatal collisions who were...",
x = NULL,
y = NULL,
caption = "National Highway Traffic Administration \n Data from library(fivethirtyeight)") +
theme(plot.background = element_rect(colour = "white"),
panel.grid = element_blank(),
panel.grid.major = element_blank(),
axis.text.x = element_text(margin = margin(t = 2)),
axis.text.y = element_blank(),
axis.text = element_text(size = 8L),
axis.line = element_blank(),
panel.border = element_rect(colour = "black",
fill = NA,
size = 0.3),
strip.background = element_rect(fill = "grey85",
colour = "black",
size = 0.3),
axis.ticks.length = unit(1L, "pt"),
strip.text.x = element_text(margin = margin(t = 1, b = 1), size = 11))
# load necessary packages
library(tidyverse)
library(geofacet)
library(fivethirtyeight)
library(purrr)
# add random noise to create simulated longitudinal data
usa_drivers <- bad_drivers %>%
mutate(`2010` = (100 - perc_not_distracted) / 100)
usa_drivers$`2011` <- usa_drivers$`2010` * rnorm(51, mean = 1.05, sd = 0.05)
usa_drivers$`2012` <- usa_drivers$`2011` * rnorm(51, mean = 1.05, sd = 0.05)
usa_drivers$`2013` <- usa_drivers$`2012` * rnorm(51, mean = 1.05, sd = 0.05)
usa_drivers$`2014` <- usa_drivers$`2013` * rnorm(51, mean = 1.05, sd = 0.05)
usa_drivers$`2015` <- usa_drivers$`2014` * rnorm(51, mean = 1.05, sd = 0.05)
# gather data from wide-format to long-format
usa_drivers <- usa_drivers %>%
gather(`2010`:`2015`, key = "Year", value = "Share Distracted")
# create a tibble with state abbreviations
state_abbreviations <- tibble(state = c(state.name, "District of Columbia"),
abbreviation = c(state.abb, "DC"))
# merge state abbreviations on to bad_drivers based on state
usa_drivers <- left_join(x = usa_drivers, y = state_abbreviations, by = "state")
# test merge criteria for mismatches. hopefully nrow(mismatches) == 0
mismatches <- anti_join(x = usa_drivers, y = state_abbreviations, by = "state")
# create a custom geofacet grid
urban_grid <- tibble(
row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 8, 8, 8),
col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)
# plot!
usa_drivers %>%
ggplot(mapping = aes(x = Year, y = `Share Distracted`, group = abbreviation)) +
geom_line() +
scale_x_discrete(breaks = c(2010, 2015), labels = c("'10", "'15")) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1), breaks = c(0, 0.5, 1)) +
facet_geo(facets = ~abbreviation, grid = urban_grid) +
labs(title = "Driving is dangerous",
subtitle = "Simulated Distracted Driving is Increasing",
x = NULL,
y = NULL,
caption = "National Highway Traffic Administration \n Data from library(fivethirtyeight) + Random Noise") +
theme(plot.background = element_rect(colour = "white"),
axis.text.x = element_text(margin = margin(t = 2)),
axis.text = element_text(size = 8L),
axis.line = element_blank(),
panel.border = element_rect(colour = "black",
fill = NA,
size = 0.3),
strip.background = element_rect(fill = "grey85",
colour = "black",
size = 0.3),
axis.ticks.length = unit(1L, "pt"),
strip.text.x = element_text(margin = margin(t = 1, b = 1), size = 11))
Dot maps represent observations or groups of observations at longitudes and latitudes. Most dot maps add dots on top of a recognizable geography such as the Continental United States. It is common to add color or size as aesthetics representing an additional variable like race and ethnicity or income.
This motivating example loads a map of the United States, adds geom_point() for each capital as a layer on top of the map, and then labels the capitals with geom_text_repel().
# load the necessary packages
library(tidyverse)
library(fiftystater)
library(ggrepel)
# load state shapefiles
state <- map_data("state")
# drop Alaska and Hawaii
# this data frame has the latitutde and longitude for each capital
state_capitals <- read_csv("mapping/data/state-capitals.csv") %>%
filter(!state %in% c("Alaska", "Hawaii"))
# add theme for maps
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0)
)
# map!
ggplot() +
geom_polygon(data = state, mapping = aes(x = long, y = lat, group = group),
color = "#ffffff", fill = "#d2d2d2", size = 0.2) +
geom_point(data = state_capitals, aes(longitude, latitude), alpha = 0.5) +
geom_text_repel(data = state_capitals, aes(longitude, latitude, label = city),
size = 3, color = "#000000", family = "Lato") +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(title = "State Capitals",
x = NULL,
y = NULL,
caption = "Urban Institute") +
theme_map
The code above is the quickest and easiest method for making a dot map in R. Dot maps in R typically don’t include Alaska, Hawaii, or territories like Puerto Rico. For this reason, library(fiftystater) is better for mapping geographic areas smaller than the continental United States. (There is potential for this to change with subplots.)
Subsets of the above map can be made by adjusting the bounding box. Another option is qmplot(). This function will automatically fit the bounding box to the data. Use I() with color = and alpha = so those variables don’t show up in the legends. Here’s a map of mortgages in Washington, D.C.:
qmplot() has many different map types which can be set with the maptype = argument. "toner-lite" is the only map type that isn’t cluttered.
# load the necessary packages
library(tidyverse)
library(ggmap)
# read the mortgage data for DC in 2011
mortgages <- read_csv("mapping/data/districtofcolumbia2011.csv")
# quick plot
qmplot(x = longitude, y = latitude, data = mortgages, maptype = "toner-lite",
color = I("#1696d2"), alpha = I(0.2)) +
labs(title = "Mortgage purchases in 2011",
subtitle = "Washington, D.C.",
caption = "Urban Institute analysis of HMDA data") +
theme(axis.line = element_blank())
Here’s a similar map of mortgages in Washington, D.C. with geom = "density2d".
# load the necessary packages
library(tidyverse)
library(ggmap)
# load mortgage data for DC in 2011
mortgages <- read_csv("mapping/data/districtofcolumbia2011.csv")
# quick plot!
qmplot(x = longitude, y = latitude, data = mortgages, maptype = "toner-lite",
geom = "density2d", color = I("#1696d2")) +
labs(title = "Mortgage purchases in 2011",
subtitle = "Washington, D.C.",
caption = "Urban Institute analysis of HMDA data") +
theme(axis.line = element_blank())
get_map() can also be used to pull map images from Google maps. maptype = controls the map type and can be set to “terrain”, “satellite”, and “roadmap”. geom_point() and geom_polygon() can be used to add layers on top of the map. Instead of a bounding box, Google maps uses the argument zoom =. Zoom is an integer that can range from 3 (continent) to 21 (building). There is little way to know in advance what works best without trying different combinations.
Note: The Google brand can’t be modified according to Google’s terms of service. Because of strict branding guidelines, Google Maps can only be used for internal products. They are still useful as a way to see data in the context or recognizable roads, landmarks, and geographies.
# load the necessary packages
library(tidyverse)
library(sp)
library(rgdal)
library(broom)
library(ggmap)
# load mortgage data for DC in 2011
mortgages <- read_csv("mapping/data/districtofcolumbia2011.csv")
# load and tidy the Washington D.C. shapefile
shapes <- readOGR(dsn = "mapping/shapefiles/tl_2015_11_tract") %>%
tidy()
## OGR data source with driver: ESRI Shapefile
## Source: "mapping/shapefiles/tl_2015_11_tract", layer: "tl_2015_11_tract"
## with 179 features
## It has 12 fields
## Integer64 fields read as strings: ALAND AWATER
# map!
ggmap(get_map(location = "Washington, D.C.", zoom = 12, maptype = "terrain")) +
geom_point(data = mortgages, aes(longitude, latitude),
alpha = 0.5, size = 0.8, color = "#ec008b") +
geom_polygon(data = shapes, aes(long, lat, group = group), fill = "#d2d2d2",
color = "black", size = 0.2, alpha = 0.2) +
labs(title = "Mortgage Purchases in 2015",
subtitle = "Washington, D.C.",
x = NULL,
y = NULL,
caption = "Urban Institute Analysis of HMDA Data") +
theme(axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank())
library(ggmap) can also be used to create journey maps. This should look familiar to anyone who has used Google maps for navigation.
# load the necessary packages
library(tidyverse)
library(ggmap)
library(kableExtra)
# get trek data
richmond_dc <- ggmap::trek(from = "Richmond, Virginia",
to = "Fredericksburg, Virginia",
structure = "route")
# create table of trek data
#head(richmond_dc) %>%
# kable(digits = 2, caption = "Output from ggmap::trek()", format = "html") %>%
# kableExtra::kable_styling(full_width = FALSE)
# map trek data
qmap("Ashland, Virginia", zoom = 9, override_limit = TRUE) +
geom_path(data = richmond_dc, mapping = aes(x = lon, y = lat),
colour = "#fdbf11",
size = 1.5,
alpha = .5,
lineend = "round") +
labs(title = "Recreating Google Maps in R")
Like dot maps, paths can be layered on top of any ggplot coordinate plane. Here’s a journey map of an imaginary vacation.
# load the necessary packages
library(tidyverse)
library(fiftystater)
library(ggrepel)
# create a tibble with start and end locations for each trip leg
vacation <- tibble(start = c("Richmond, Virginia", "Annapolis, Maryland", "Columbus, Ohio", "Sacramento, California", "Phoenix, Arizona")) %>%
mutate(end = lead(start)) %>%
mutate(end = ifelse(is.na(end), "Richmond, Virginia", end))
# map the coordinates from each start and end location for each trip leg
vacation_route <- map2(.x = vacation$start, .y = vacation$end,
.f = ggmap::trek, structure = "route") %>%
reduce(bind_rows)
# load state shapefiles
state <- map_data("state")
# drop Alaska and hawaii
# this data frame has the latitutde and longitude for each capital
state_capitals <- read_csv("mapping/data/state-capitals.csv") %>%
filter(state %in% c("Maryland", "Ohio", "Indiana", "Iowa", "Nebraska", "Wyoming", "Utah", "California", "Arizona", "Oklahoma", "Arkansas", "Tennessee", "Virginia"))
# add theme for maps
theme_map <- theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0)
)
# map!
ggplot() +
geom_polygon(data = state, mapping = aes(x = long, y = lat, group = group),
color = "#ffffff", fill = "#d2d2d2", size = 0.2) +
geom_point(data = state_capitals, aes(longitude, latitude), alpha = 0.5) +
geom_path(data = vacation_route, aes(lon, lat), color = "#fdbf11") +
geom_text_repel(data = state_capitals, aes(longitude, latitude, label = city),
size = 3, color = "#000000", family = "Lato") +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(title = "Vacation Roadmap",
x = NULL,
y = NULL) +
theme_map
route() is an alternative to trek() that returns similar results. trek() sticks closer to the roads according to the package’s GitHub repository.
# load the necessary packages
library(tidyverse)
library(ggmap)
# map on google maps
richmond_dc <- route(from = "Richmond, Virginia",
to = "Urban Institute",
structure = "route")
head(richmond_dc) %>%
kable(digits = 2, caption = "Output from ggmaps::route")
| m | km | miles | seconds | minutes | hours | leg | lon | lat |
|---|---|---|---|---|---|---|---|---|
| 49 | 0.05 | 0.03 | 15 | 0.25 | 0.00 | 1 | -77.44 | 37.54 |
| 946 | 0.95 | 0.59 | 151 | 2.52 | 0.04 | 2 | -77.44 | 37.54 |
| 33 | 0.03 | 0.02 | 3 | 0.05 | 0.00 | 3 | -77.43 | 37.55 |
| 5807 | 5.81 | 3.61 | 246 | 4.10 | 0.07 | 4 | -77.43 | 37.55 |
| 103597 | 103.60 | 64.38 | 3312 | 55.20 | 0.92 | 5 | -77.47 | 37.58 |
| 14548 | 14.55 | 9.04 | 478 | 7.97 | 0.13 | 6 | -77.41 | 38.46 |
Geocoding is the process of converting addresses into coordinates. Longitudes and latitudes are often necessary for creating dot maps, calculating distances, and linking addresses to geographies like Census tracts.
For a thorough outline of geocoding methods, read the National Neighborhood Indicators Partnership geocoding memo (available on Box) written by Rob Pitingolo. The rest of this section outline methods available in R for geocoding addresses.
The Google Maps Geocoding API is a service that provides geocoding and reverse geocoding. Its main advantage is “fuzzy matching”, which means it can often find coordinates for incorrect or imprecise addresses. It is fast, but the free service is limited to 2,500 queries per day. Use geocodeQueryCheck() to check remaining number of geocoding queries and to create logic for pausing geocoding.
Note, attempts to work around the query limit by splitting the data across multiple machines have so far been unsuccessful.
The Google Maps Geocoding API also has restrictive terms of service. In particular:
10.4 d. No use of Content without a Google map. Unless the Maps APIs Documentation expressly permits you to do so, you will not use the Content in a Maps API Implementation without a corresponding Google map. For example, you may display Street View imagery without a corresponding Google map because the Maps APIs Documentation expressly permits this use.
10.4 e. No use of Content with a non-Google map. You must not use the Content in a Maps API Implementation that contains a non-Google map.
geocode() in library(ggmap) is the easiest way to use the Google Maps Geocoding API. The get_coordinates() function below is built around geocode() and is heavily inspired by shanelyn on R-Bloggers. It takes an address and returns latitude, longitude, the matched address, and the address type from the Google Maps Geocoding API.
geocode() needs addresses or locations to be one string in one column/variable instead of multiple strings across multiple columns/variables like street, city, and zip. Fortunately, R has powerful tools for string and vector manipulation. Most addresses can be combined using mutate() and paste(). The strings chapter in R for Data Science is a good resource for more sophisticated string manipulation.
# load the necessary packages
library(tidyverse)
library(ggmap)
library(stringr)
# create coordinates function
get_coordinates <- function(address) {
# geocode the address (returns a complicated data structure)
geo_reply <- geocode(location = address, output = 'all', messaging = TRUE, source = "dsk")
# create an empty tibble to populate with tidy data
answer <- data.frame(
latitude = NA,
longitude = NA,
accuracy = NA,
formatted_address = NA,
address_type = NA,
status = NA
)
# store the status of the geo code
answer$status <- geo_reply$status
# store the number of remaining geocodes
remaining_queries <- geocodeQueryCheck()
# if the R session is over the Google geocode query limit, the pause for 1 minute
while (remaining_queries <= 5) {
print(paste("over query limit,", remaining_queries, " queries remaining. Pausing for 1 minute"))
time <- Sys.time()
print(as.character(time))
Sys.sleep(60)
remaining_queries <- geocodeQueryCheck()
}
# if the geocode was bad, return the empty data frame
if (geo_reply$status != "OK") {
return(answer)
}
# else, extract what we need from the Google server reply into the tibble
answer$latitude <- geo_reply$results[[1]]$geometry$location$lat
answer$longitude <- geo_reply$results[[1]]$geometry$location$lng
if (length(geo_reply$results[[1]]$types) > 0){
answer$accuracy <- geo_reply$results[[1]]$types[[1]]
}
answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
answer$formatted_address <- geo_reply$results[[1]]$formatted_address
return(answer)
}
The above function can be used on a single address or it can be iterated across a vector of addresses. Geocoding one address is as simple as calling any function, like mean() or sum() in R.
# load the necessary packages
library(tidyverse)
library(ggmap)
# use the get_coordinates function
get_coordinates(address = "1600 Monument Ave. Richmond, Virginia 23220") %>%
kable(digits = 2, caption = "Output from geocoding one address")
| latitude | longitude | accuracy | formatted_address | address_type | status |
|---|---|---|---|---|---|
| 37.55 | -77.46 | street_address | 1600 Monument Ave. Richmond, Virginia 23220 | street_address | OK |
Geocoding multiple addresses is a little more complicated than geocoding one address. It requires iterating the function across a vector of addresses using map() from library(purrr). As the output from the previous function shows, get_coordinates() returns a data frame of interesting information that then needs to be parsed into vectors using latitude = map_dbl(coords, 1), longitude = map_dbl(coords, 2) and accuracy = map_chr(coords, 3).
Note: map() from library(purrr) often creates conflicts when mapping. Be careful with the order in which packages are loaded and don’t be afraid to use double colons like purrr::map().
# laod the necessary packages
library(tidyverse)
library(purrr)
library(ggmap)
# creates vectors of institution names and addresses
institutions <- c("Urban Institute", "Brookings Institution")
addresses <- c("2100 M Street NW, Washington DC",
"1775 Massachusetts Ave NW, Washington, DC 20036")
# iterate the function over the tibble and return the answer
tibble(institution = institutions,
address = addresses) %>%
mutate(coords = purrr::map(address, get_coordinates)) %>%
mutate(latitude = map_dbl(coords, 1),
longitude = map_dbl(coords, 2),
accuracy = map_chr(coords, 3)) %>%
select(-coords) %>%
kable(digits = 2, caption = "Output from geocoding multiple addresses")
| institution | address | latitude | longitude | accuracy |
|---|---|---|---|---|
| Urban Institute | 2100 M Street NW, Washington DC | 38.91 | -77.05 | street_address |
| Brookings Institution | 1775 Massachusetts Ave NW, Washington, DC 20036 | 38.91 | -77.04 | street_address |
It is possible to pay for additional queries. The current rate is $0.50 per additional 1,000 requests with a daily limit of 100,000 requests. This requires a more advanced version of library(ggmap) that is only accessible through library(devtools). devtools sources the package from GitHub instead of CRAN. This version of ggmap is less stable but it includes register_google(). Adding an API Key can increase the daily quota to 100,000.
ggmap_credentials()
## Google -
## key :
## account_type : standard
## day_limit : 2500
## second_limit : 50
## client :
## signature :
register_google(key = "[your key here]", account_type = "premium", day_limit = 100000)
mutate_geocode() from library(ggmap) is similar to geocode, but it appends the latitude and longitude to a data frame. Note: this only works with data.frame() and does not work with tibble() or objects of class tbl_df. It also doesn’t return information about the matched address which is useful for correcting addresses or address types.
mutate_geocode() has the same query limit and characteristics as geocode (above).
# load the necessary packages
library(tidycensus)
library(ggmap)
institutions <- c("Urban Institute", "Brookings Institution")
addresses <- c("2100 M Street NW, Washington DC",
"1775 Massachusetts Ave NW, Washington, DC 20036")
data.frame(institution = institutions,
address = addresses,
stringsAsFactors = FALSE) %>%
mutate_geocode(address, source = "dsk") %>%
kable(digits = 2, caption = "Output from ggmap::mutate_geocode")
| institution | address | lon | lat |
|---|---|---|---|
| Urban Institute | 2100 M Street NW, Washington DC | -77.05 | 38.91 |
| Brookings Institution | 1775 Massachusetts Ave NW, Washington, DC 20036 | -77.04 | 38.91 |
rm(addresses, institutions)
The US Census Geocoder API is another option for geocoding. It lacks a fuzzy match so it is less forgiving of address imperfections and it is slower than the Google Maps Geocoding API, but its quota is based on number of observations per batch submission instead of a rate of observations per time. This is easier to accommodate grammatically by calling the function on data sets containing fewer than 1,000 observations.
Here’s a useful blog post by Andrew Wheeler that inspired some of the following code.
# load the necessary packages
library(tidyverse)
library(httr)
library(jsonlite)
library(stringr)
# create a tibble with an address
addresses <- tibble(
street = c("1940 Corner Rock Rd."),
city = c("Midlothian"),
state = c("Virginia"),
zip = c("23113")
)
# save the API URL
api_url <- "https://geocoding.geo.census.gov/geocoder/locations/address?"
# function
get_census_address <- function(street, city, state, zip) {
# use GET() to retrieve JSON from the API based on street, city, state,
# zip, and the API benchmark
unretrieved_json <- GET(url = api_url,
query = list(street = street,
city = city,
state = state,
zip = zip,
format = 'json',
benchmark = 4))
# retrieve the contents of the API request as text with a UTF-8 encoding
retrieved_json <- content(unretrieved_json, as = 'text', encoding = "UTF-8")
# simplify JSON into an r object
parsed_json <- fromJSON(retrieved_json, simplifyVector = TRUE)
# extract the complete matched addresses from the r object
matched_addresses <- parsed_json$result$addressMatches
# if there is at least one matcg, return it
if (length(matched_addresses) > 1) {
temp <- c(matched_addresses['matchedAddress'], matched_addresses['coordinates'][[1]])
return(c(temp[[1]], temp[[2]], temp[[3]]))
}
# if there is no match, return missing values
else {return(c('', NA, NA))}
}
# iterate the function over the list of addresses
output <- addresses %>%
mutate(geocode = pmap(list(street, city, state, zip), get_census_address)) %>%
mutate(new_street = purrr::map(geocode, 1),
latitude = purrr::map(geocode, 2),
longitude = purrr::map(geocode, 3))
output %>%
mutate(row_number = row_number()) %>%
select(row_number, address = new_street, latitude, longitude) %>%
mutate(accuracy = NA,
address = unlist(address),
latitude = unlist(latitude),
longitude = unlist(longitude)) %>%
kable(digits = 2, caption = "Geocoding with US Census Geocoder API")
| row_number | address | latitude | longitude | accuracy |
|---|---|---|---|---|
| 1 | 1940 CORNER ROCK RD, MIDLOTHIAN, VA, 23113 | -77.633194 | 37.519318 | NA |
An optimal strategy for geocoding many addresses could be to break the data into 1,000 observation chunks and submit each chunk to the US Census Geocoder API. Then, use the Google Maps Geocoding API to geocode observations for which the US Census Geocoder API could not find a match. This could all be accomplished with relatively simple R code.
mapdist() returns the expected travel distance and expected travel time between two locations. The function is vectorized. If a vector of location is passed into the function then it will return a tidy data frame. mode = can be set to “driving”, “walking”, “bicycling”, or “transit”.
The library(ggmap) GitHub page is a good resource for this type of analysis.
# load the necessary packages
library(ggmap)
ggmap::mapdist(from = "Lincoln Monument",
to = "Washington Monument",
mode = "walking",
output = "simple") %>%
kable(digits = 2, caption = "Output from ggmap::mapdist()")
| from | to | m | km | miles | seconds | minutes | hours |
|---|---|---|---|---|---|---|---|
| Lincoln Monument | Washington Monument | 1397 | 1.4 | 0.87 | 1058 | 17.63 | 0.29 |
Dot maps are often more informative and convincing than choropleth maps, but data with geographic variables are usually defined by geography level (Census tract, county, etc.) instead of longitude and latitude.
If the data are grouped by small geographies, like a Census tracts, coordinates can be randomly sampled from the shapefile for that geography. In this case, the information gained by being able to see the geographic distribution is often justifies the small amount of sampling error.
The Housing Finance Policy Center’s “An interactive view of the housing boom and bust” uses data with sampled longitudes and latitudes from Census track shapefiles for millions of mortgages across sixteen years. The following example is based on the code from that project.
The two biggest challenges in sampling coordinates are usually: 1) Finding the correct shapefile because geographies change over time and geographic features like water exist in the middle of geographies. 2) Using and merging the shapefile in R because shapefiles are usually complicated hierarchical data structures.
IPUMS helps with first issue because it is a good source for high-quality shapefiles.
The new version of RStudio helps with the second issue because it boasts a new feature that visualizes the branches of the data and helps select desired variables. Click on the shapefile in RStudio to bring up a map of the data. Mouse over the desired variable and click the little logo that appears to the far right.
A line of code that selects the desired variable should appear in the console. For example: shapes@polygons[[1]]@labpt. This trick makes hierarchical data much more manageable.
The following example samples coordinates for mortgage purchase originations by Asian Americans in Rhode Island.
# load the necessary packages
library(tidyverse)
library(sp)
library(rgdal)
library(stringr)
sample_number <- 0
The first function takes one observation with a Census tract number and the number of mortgages in the Census tract and returns a data frame with an observation for each mortgage with longitude and latitude. spsample() from library(sp) does the random sampling and has a number of options for the type of sampling. All of the function options can be seen by submitting ?spsample.
get_coordinates <- function(census.tract, number.of.mortgages) {
# Purpose: takes shapefile and randomly sample longitudes and latitudes in census tracts
# Args:
# census.tract: Census tract ID
# number.of.mortgages: number of mortgages in Census tract
# Output: list with longitudes and latitudes for mortgages in each ward
# Create unique seed number for each random sample
sample_number <<- sample_number + 1
# Set seeds so pseudo-random sampling is reproducible
set.seed(max(sample_number, 1))
# Create a temporary data frame with the sampled coordinates
temp <- spsample(x = tracts_shapefile@polygons[[census.tract]],
n = number.of.mortgages,
type = "random",
iter = 100)
# If temp has sampled coordinates, then split into longitude and latitude and return
# Else, assign and return NAs
if (!is.null(temp)) {
longitude <- temp@coords[, 1]
latitude <- temp@coords[, 2]
return(tibble(longitude = longitude, latitude = latitude))
} else {
return(tibble(longitude = NA, latitude = NA))
}
}
The second function subsets the full data set by loan type and race/ethnicity and then iterates get_coordinates(), the function from above, across the subset.
The code looks more complicated than it really is. First, the quasi-quotation is necessary because of tidy evaluation. Second, the function created nested vectors that need to be unnested with unnest().
latlong_function <- function(loan_type, race_ethnicity, formula, race.ethnicity, loan.type) {
# Purpose: Take df with one observation per Census tract and return df with one observation per 10 mortgages with coordinates
# Args:
# loan_type: purchase or refinance
# race_ethnicity: total, aa, his, asian, nhw
# formula: total > 0, aa > 0, his > 0, asian > 0, nhw > 0
# race.ehtnicity:
# loan.type:
# Output: one of ten combinations of race and mortgage type
# Create quasi-quoted arguments
race_ethnicity <- enquo(race_ethnicity)
formula <- enquo(formula)
# Print statements to keep track of progress
print(formula)
print(race_ethnicity)
print(exists("tracts_shapefile"))
# Match the positions of the Census tracts and subset by race/ethnicity
temp <- loan_type %>%
group_by(tract) %>%
mutate(position = match(tract,
as.numeric(as.character(tracts_shapefile@data$GEOID)))) %>%
filter(!!formula) %>%
arrange(position)
# If the subsetted data frame has > zero rows, then run get_coordinates
# Else, don't run get_coordinates and go to the next observation
if (nrow(temp) > 0) {
temp <- temp %>%
mutate(coordinates = map2(.x = unique(position),
.y = !!race_ethnicity,
.f = get_coordinates)) %>%
mutate(raceethnicity = race.ethnicity,
loantype = loan.type) %>%
unnest()
return(temp)
} else {
rm(temp)
}
}
The third function reads shapefiles, changes the map projection of the shapefiles, and runs latlong_function() on each race/ethnicity. It takes a shapefile name and returns a tidy tibble with observations for every ~10 mortgages.
readR <- function(shapefile) {
# Purpose: Reads the state shapefile, set the map projection, and sample
# for total plus the four race/ethnicity groups
# Args:
# shapefile: Name of the shapefile for sampling
# Output: Tibble with observations for every ten mortgages
# Read the relevant shapefile
tracts_shapefile <<- readOGR(paste0("mapping/shapefiles/", shapefile))
# Set the map projection for the shapefile
tracts_shapefile <<- spTransform(tracts_shapefile, "+init=epsg:4326")
# Confirm the shapefile exists
print(exists("tracts_shapefile"))
# Create data frames for each race/ethnicity
purchase_total <- latlong_function(loan_type = purchase,
race_ethnicity = total,
formula = total > 0,
race.ethnicity = "total",
loan.type = "purchase")
purchase_aa <- latlong_function(loan_type = purchase,
race_ethnicity = aa,
formula = aa > 0,
race.ethnicity = "aa",
loan.type = "purchase")
purchase_his <- latlong_function(loan_type = purchase,
race_ethnicity = his,
formula = his > 0,
race.ethnicity = "his",
loan.type = "purchase")
purchase_asian <- latlong_function(loan_type = purchase,
race_ethnicity = asian,
formula = asian > 0,
race.ethnicity = "asian",
loan.type = "purchase")
purchase_nhw <- latlong_function(loan_type = purchase,
race_ethnicity = nhw,
formula = nhw > 0,
race.ethnicity = "nhw",
loan.type = "purchase")
# combine the five data frames into one data fram
full_dataset <- bind_rows(purchase_aa, purchase_asian, purchase_his, purchase_nhw, purchase_total)
rm(purchase_aa, purchase_asian, purchase_his, purchase_nhw, purchase_total)
return(full_dataset)
}
The above code loads packages and defines functions. Nothing actually happens until the functions are called. The following code chunk actually calls the function.
readR() can be run on just one shapefile like readR("tl_2012") or it can be iterated across a vector of shapefile names using map() from library(purrr). This example uses the former. The actual program used the latter and iterated across shapefiles and data from 2000 to 2016. Each observation had a data frame for the given year and three unique shapefiles were used because Census tracts change over time.
This code loads the data using read_csv() and runs the function readR() on the 2012 shapefile for Rhode Island.
purchase <- read_csv("mapping/data/hmdamap2012_10_purch.csv",
col_types = cols(
tract = col_double(),
year1 = col_integer(),
tract_T = col_character(),
aa = col_integer(),
his = col_integer(),
nhw = col_integer(),
asian = col_integer(),
total = col_integer())) %>%
filter(str_detect(tract, "^44")) %>%
filter(asian > 0)
rhode_island <- readR("tl_2012")
Once the coordinates are sampled, it’s simple to create a dot density map using the methods from above.
This section is a repository of advanced maps made in R.
This plot uses grid.arrange() from library(gridExtra) to add a bar plot of the same data below the tile grid map.
library(tidyverse)
library(forcats)
library(gridExtra)
library(fiftystater)
chip <- read_csv("mapping/data/chip-enrollment.csv")
# Create data frame with vector boundaries
data("fifty_states")
# create groupings for states
chip <- chip %>%
mutate(state = tolower(State)) %>%
arrange(`CHIP Enrollment`) %>%
mutate(enrollment_group = c(rep("Group 1", 11),
rep("Group 2", 10),
rep("Group 3", 10),
rep("Group 4", 10),
rep("Group 5", 10)))
# create a vector with 5 hexadecimal colors for the 5 groups
urban_colors <- c("#cfe8f3", "#a2d4ec", "#46abdb", "#12719e", "#062635")
# Plot!
chip_map <- ggplot(chip, aes(map_id = state)) +
geom_map(aes(fill = enrollment_group), map = fifty_states,
color = "#ffffff", size = 0.25) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
scale_fill_manual(values = urban_colors) +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(title = "State CHIP Enrollment") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0))
# bar plot
chip_bar_plot <- chip %>%
ggplot(aes(x = fct_reorder(state_abbreviation, `CHIP Enrollment`),
y = `CHIP Enrollment`, fill = enrollment_group)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = urban_colors) +
scale_y_continuous(expand = c(0, 0)) +
labs(x = NULL,
y = NULL,
caption = "Urban Institute") +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
axis.text = element_text(size = 6L),
legend.position = "none",
plot.margin = margin(t = 0, r = 0, b = 0, l = 0))
grid.arrange(chip_map, chip_bar_plot, ncol = 1, heights = c(2, 1))
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] jsonlite_1.5 httr_1.3.1 ggmap_2.7
## [4] geofacet_0.1.5 broom_0.4.3 htmltools_0.3.6
## [7] tidycensus_0.4.1 viridis_0.5.0 viridisLite_0.3.0
## [10] leaflet_1.1.0 sf_0.6-0 USAboundaries_0.3.0
## [13] rgdal_1.2-16 sp_1.2-7 mapdata_2.2-6
## [16] maps_3.2.0 fiftystater_1.0.1 gridExtra_2.3
## [19] kableExtra_0.7.0 RXKCD_1.8-2 wooldridge_1.2.0
## [22] RColorBrewer_1.1-2 AmesHousing_0.0.3 fivethirtyeight_0.3.0
## [25] waffle_0.7.0 ggridges_0.4.1 gapminder_0.3.0
## [28] hexbin_1.27.2 bindrcpp_0.2 ggrepel_0.7.0
## [31] extrafont_0.17 forcats_0.2.0 stringr_1.2.0
## [34] dplyr_0.7.4 purrr_0.2.4 readr_1.1.1
## [37] tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1
## [40] tidyverse_1.2.1 knitr_1.19
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.3-2 rjson_0.2.15 class_7.3-14
## [4] rprojroot_1.3-2 rstudioapi_0.7 lubridate_1.7.1
## [7] xml2_1.2.0 mnormt_1.5-5 Rttf2pt1_1.3.5
## [10] png_0.1-7 shiny_1.0.5 mapproj_1.2-5
## [13] compiler_3.4.3 backports_1.1.2 assertthat_0.2.0
## [16] Matrix_1.2-12 lazyeval_0.2.1 cli_1.0.0
## [19] tools_3.4.3 gtable_0.2.0 glue_1.2.0
## [22] reshape2_1.4.3 rappdirs_0.3.1 Rcpp_0.12.15
## [25] cellranger_1.1.0 RJSONIO_1.3-0 nlme_3.1-131
## [28] udunits2_0.13 extrafontdb_1.0 tigris_0.6.2
## [31] crosstalk_1.0.0 psych_1.7.8 proto_1.0.0
## [34] rvest_0.3.2 mime_0.5 MASS_7.3-48
## [37] scales_0.5.0 hms_0.4.1 parallel_3.4.3
## [40] yaml_2.1.16 curl_3.1 geosphere_1.5-7
## [43] stringi_1.1.6 highr_0.6 maptools_0.9-2
## [46] e1071_1.6-8 bitops_1.0-6 RgoogleMaps_1.4.1
## [49] rlang_0.1.6 pkgconfig_2.0.1 evaluate_0.10.1
## [52] lattice_0.20-35 bindr_0.1 htmlwidgets_1.0
## [55] labeling_0.3 tidyselect_0.2.3 plyr_1.8.4
## [58] magrittr_1.5 R6_2.2.2 DBI_0.7
## [61] pillar_1.1.0 haven_1.1.1 foreign_0.8-69
## [64] mgcv_1.8-23 units_0.5-1 modelr_0.1.1
## [67] crayon_1.3.4 uuid_0.1-2 utf8_1.1.3
## [70] rmarkdown_1.8 jpeg_0.1-8 readxl_1.0.0
## [73] digest_0.6.15 classInt_0.1-24 xtable_1.8-2
## [76] httpuv_1.3.5 munsell_0.4.3